Overview

Row

Introduction

Ever since its inception, Starbucks has grown to play an important in the everyday life of many people and helped redefine coffee cultures around the world. For the past years, to cater to the needs of a growing increase body of health conscious people, Starbucks has introduced some drinks that are claimed to be healthier than its common drinks.

Report’s objectives

This reports aims to explore some ingredients that are included in Starbucks drinks and their relationships with the number of calories. Based on this, it will also attempt to predict the number of calories in a hypothetical Starbucks drink with given amount of some ingredients.

Report’s structure

This report consists of four parts (minus the landing page and references section). The first part will cover the methodology that will be used to answer the list of research questions proposed. The next two parts will be used for analyses of the distribution and correlation of nutritious ingredients in Starbucks drinks with two categories (non-caffeine/caffeine) in relation to calories. The last part will focus on establishing a model that can be used for calories prediction.

About the data

Data source

The data used in this report comes from the Tidy Tuesday dataset, containing nutrition facts about 93 drinks at Starbucks. These include total fat, sugar, sodium, caffeine, cholesterol, to name a few. It is drawn from the Starbucks Coffee Company Beverage Nutrition Information file published by Starbucks itself.

Limitation and possible consequences

  • The dataset only contains 93 drinks at Starbucks with different sizes provided for each drink. The details of some drinks are also not available. Hence, it does not necessarily provide a comprehensive picture of how Starbucks drink would be in reality in terms of nutrition advice.

  • For the above reason, the report refrains from making conclusive recommendations on the healthiest drink at Starbucks.

Methodology

Row

List of research questions and methods used

All of the following questions will be explored using the same set of data, which is cleaned based on the raw data source (More information can be found in the third sub-tab).

  • Research question 1: What are the distribution of sodium, total fat and total carbs in Starbucks drinks by category (caffeine/non-caffeine)?

    • Methodology: Use of histogram for each type of drink at Starbucks. Summary table for the average number of each ingredient found in each type.
  • Research question 2: What is the relationship between sodium, total fat and total carbs in relation with calories?

    • Methodology: Use of interactive scatterplot for each type of drink at Starbucks. Summary table for the correlation coefficients that measure the strength of correlation between each ingredient and calories.
  • Research question 3: What is a linear regression model that can showcase the relationship between sodium, total fat, total carbs and calories? What is a predicted level of calories in a Starbucks drink if it has 20mg of sodium, 10g of total fat, and 40g of total carbs?

    • Methodology: Use of linear regression model with evaluation of residuals and R squared.

Notes

  • Ingredients like sugar, saturated fat, trans fat, etc are intentionally not considered in this report for two reasons. First, they are generally parts of some other ingredients in the dataset that will be considered (e.g. sugar is part of the total carbohydrates, saturated fat and trans fat are parts of the total fat). If they were to be included as well, it would cause some trouble with the regression analysis in the third part of the report. Second, given the expected output as outlined in the rubric, it is best to include only highly relevant ingredients in the analysis so that the final report would look concise while being informative enough.

  • All drinks considered in this report will be in grande to ensure the consistency. Grande is the most popular size in the original dataset.

  • The color choice in this report is taken from color blind friendly palettes suggested by Nichols (n.d.).

Data cleaning

1. Missingness check

First, in order to make sure the data does not suffer from having a lot of missing values, we first check how many missing values it contains (if any) in the data.

Because the dataset contains 0 missing values, further steps to analyse the data can be proceeded.

2. Data type check

We need to check the data type of all variables in the dataset to make sure they are in the right format using the str() function. By doing this, I can notice that the trans_fat_g and fiber_g variables are having the character type, which should be corrected to be numeric.

3. Creation of new variable

I also created one column called Type to differ caffeine drinks and non-caffeine drinks.

4. Size filtering

Size is filtered to include grande size only.

Research question 1

Row

Distribution of soldium

Distribution of total fat

Distribution of total carbs

Row

Average sodium, total fat and total carbs of Starbucks drink

type average_soldium_mg average_total_fat_g average_total_carbs_g
caffeine drink 148.08 6.68 39.40
non-caffeine drink 166.78 6.65 50.39

Research question 2

Column

Sugar and calories

Total fat and calories

Total carbs and calories

Row

Correlation coefficients

type sugar_and_calories total_fat_and_calories total_carbs_and_calories
caffeine drink 0.85 0.82 0.89
non-caffeine drink 0.87 0.81 0.87

Research question 3

Row

Linear regression model

# A tibble: 4 × 5
  term          estimate std.error statistic   p.value
  <chr>            <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    17.9       2.33        7.69 1.74e- 13
2 sodium_mg       0.0430    0.0217      1.98 4.86e-  2
3 total_fat_g     9.73      0.191      51.0  1.69e-158
4 total_carbs_g   3.86      0.0841     45.9  2.45e-145
  • The intercept is 17.9, meaning that a Starbucks drink that has 0mg of sodium, 0g of total fat and 0g of total carbs has 17.9 calories on average.

  • The coefficient of sodium_mg is 0.043, meaning that on average the calories for a Starbucks drink holding the level of total fat and total carbs fixed increases by 0.043 for every extra mg of sodium.

  • The coefficient of total fat is 9.73, meaning that on average the calories for a Starbucks drink holding the level of sodium and total carbs fixed increases by 9.73 for every extra g of total fat.

  • The coefficient of total carbs is 3.86, meaning that on average the calories for a Starbucks drink holding the level of sodium and total fat fixed increases by 3.86 for every extra g of total carbs.

The model can be written as follows:

\[\widehat{\text{Calories}} = 17.9 + 0.043~\text * {Sodium} + 9.73~\text * {Total fat}+ 3.86~\text * {Total carbs} + \widehat{\epsilon} ~\text (KCal) (*) \]

Note: The model (*) includes the error term at the end.

Residuals

Evaluation and prediction

To assess the fitness of the model, it is essential to examine the residuals and R squared.

In linear regression diagnostic, there are four assumptions for the residuals: independent, have a mean of zero, constant variance, and normally distributed. It can be observed that in the first residual plot (full plots available in the Residuals tab), the residuals obtained do not have a pattern, suggesting that they are independent of each other. Many of them also center around the horizontal line at zero, which indicates a mean of approximately zero. The residuals also appear to have a normal distribution. Finally, as there are not many outliers in the plots, so the residual variance can be regarded as fairly constant.

R squared

  • R squared is a common measurement of strength of linear model fit.
  • The R squared obtained in this case is 0.9776709, which is high. This suggests about 97.8% of the model can be explained by the chosen regressors so it may be a good fit. However, it does not mean that the model is good or the best.

Prediction

Based on the linear regression model (*), the total calories for a Starbucks drink with 20mg of sodium, 10g of total fat, and 40g of total carbs can be predicted as:

\[\widehat{\text{Calories}} = 17.9 + 0.043 * 20 + 9.73 * 10 + 3.86 * 40 = 270.46 ~\text (KCal)\]

References

Column

References

[1] Arel-Bundock, V. (2022). “modelsummary: Data and Model Summaries in R.” Journal of Statistical Software, 103(1), 1-23. doi:10.18637/jss.v103.i01 https://doi.org/10.18637/jss.v103.i01.

[2] Auguie, B. (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.3, https://CRAN.R-project.org/package=gridExtra.

[3] Carson, S. (2020). Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida.

[4] Food Standards Australia New Zealand. (2021). Sodium and salt. Retrieved from https://www.foodstandards.gov.au/consumer/nutrition/salt/Pages/default.aspx#:~:text=What%20are%20the%20recommended%20sodium,of%20salt%20or%201%20teaspoon).

[5] Goode, K. Rey, K. (2019). ggResidpanel: Panels and Interactive Versions of Diagnostic Plots using ‘ggplot2’. R package version 0.3.0, https://CRAN.R-project.org/package=ggResidpanel.

[6] Kassambara, A. (2020). ggpubr: ‘ggplot2’ Based Publication Ready Plots. R package version 0.4.0, https://CRAN.R-project.org/package=ggpubr.

[7] Nichols, D. (n.d.). Coloring for Colorblindness. Retrieved from https://davidmathlogic.com/colorblind/#%23FFC20A-%230C7BDC.

[8] Pedersen, T. Robinson, D. (2022). gganimate: A Grammar of Animated Graphics_. R package version 1.0.8, https://CRAN.R-project.org/package=gganimate.

[9] Robinson, D. Hayes, A. Couch, S. (2022). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.0, https://CRAN.R-project.org/package=broom.

[10] Sievert, C. Iannone, R. Allaire, J. Borges, B. (2022). flexdashboard: R Markdown Format for Flexible Dashboards. R package version 0.6.0, https://CRAN.R-project.org/package=flexdashboard.

[11] Silge, J. Robinson, D. (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS, 1(3). doi:10.21105/joss.00037 https://doi.org/10.21105/joss.00037.

[12] Tierney, N. (2017). “visdat: Visualising Whole Data Frames.” JOSS, 2(16), 355. doi:10.21105/joss.00355 https://doi.org/10.21105/joss.00355.

[13] Tierney, N. Cook, D. McBain, M. Fay, C. (2021). naniar: Data Structures, Summaries, and Visualisations for Missing Data. R package version 0.6.1, https://CRAN.R-project.org/package=naniar.

[14] Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.

[15] Wickham, H. Averick, M. Bryan, J. Chang, W. McGowan, LD. François, R. Grolemund, G. Hayes, A. Henry, L. Hester, J. Kuhn, M. Pedersen, TL. Miller, E. Bache, SM. Müller, K. Ooms, J. Robinson, D. Seidel, DP. Spinu, V. Takahashi, K. Vaughan, D. Wilke, C. Woo, K. Yutani, H. (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686

[16] Wickham, H. Bryan, J. (2022). readxl: Read Excel Files. R package version 1.4.0, https://CRAN.R-project.org/package=readxl. Wickham, H. François, R. Henry, L. Müller, K. (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.9, https://CRAN.R-project.org/package=dplyr.

[17] Wickham, H. François, R. Henry, L. Müller, K. (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.9, https://CRAN.R-project.org/package=dplyr.

[18] Wickham, H. Hester, J. Bryan, J. (2022). readr: Read Rectangular Text Data. R package version 2.1.2, https://CRAN.R-project.org/package=readr.

[19] Yihui, X. (2022). bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.29.

[20] Zhu, H. (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4, https://CRAN.R-project.org/package=kableExtra.

[21] https://i.pinimg.com/originals/05/84/ae/0584aec69a54c545710acb12e0360d24.jpg

[22] https://images.thestar.com/rXuS4l_IXmSp8s7N5-V4BDha2OY=/605x799/smart/filters:cb(2700061000)/https://www.thestar.com/content/dam/thestar/life/health_wellness/nutrition/2010/10/23/five_and_15_rule_for_food_nutrients_coming/nalabels22.jpeg

[23] https://coffeeatthree.com/wp-content/uploads/starbucks-secret-menu-1.jpg

---
title: "Assignment 2"
output: 
  flexdashboard::flex_dashboard:
    orientation: rows
    vertical_layout: scroll
    storyboard: true
    social: menu
    source_code: embed
---

```{r setup, include=FALSE, message = FALSE, warning = FALSE}
library(gganimate)
library(modelsummary)
library(flexdashboard)
library(tidyverse)
library(naniar)
library(gridExtra)
library(ggResidpanel)
library(broom)
library(readxl)
library(readr)
library(kableExtra)
library(bookdown)
library(ggplot2)
library(plotly)
library(visdat)
library(dplyr)
library(tidytext)
library(ggpubr)
```

```{r read data, echo = FALSE, message = FALSE, warning = FALSE}
starbucks <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-12-21/starbucks.csv')
```

Overview {data-icon="fa-glasses"}
===================================== 

Row {.tabset data-height=600}
-----
### **Introduction**

Ever since its inception, Starbucks has grown to play an important in the everyday life of many people and helped redefine coffee cultures around the world. For the past years, to cater to the needs of a growing increase body of health conscious people, Starbucks has introduced some drinks that are claimed to be healthier than its common drinks. 

**Report's objectives**

This reports aims to explore some ingredients that are included in Starbucks drinks and their relationships with the number of calories. Based on this, it will also attempt to predict the number of calories in a hypothetical Starbucks drink with given amount of some ingredients.


**Report's structure**

This report consists of four parts (minus the landing page and references section). The first part will cover the methodology that will be used to answer the list of research questions proposed. The next two parts will be used for analyses of the distribution and correlation of nutritious ingredients in Starbucks drinks with two categories (non-caffeine/caffeine) in relation to calories. The last part will focus on establishing a model that can be used for calories prediction.



### **About the data**

**Data source**

The [data](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-12-21/readme.md) used in this report comes from the Tidy Tuesday dataset, containing nutrition facts about 93 drinks at Starbucks. These include total fat, sugar, sodium, caffeine, cholesterol, to name a few. It is drawn from the Starbucks Coffee Company Beverage Nutrition Information file published by Starbucks itself.

**Limitation and possible consequences**

- The dataset only contains 93 drinks at Starbucks with different sizes provided for each drink. The details of some drinks are also not available. Hence, it does not necessarily provide a comprehensive picture of how Starbucks drink would be in reality in terms of nutrition advice.

- For the above reason, the report refrains from making conclusive recommendations on the healthiest drink at Starbucks.



Column {.sidebar data-width=600}
----
**STARBUCKS DRINKS ANALYSIS**

Name: Minh Phuong Trang Nguyen (Ally Nguyen)

```{r}
knitr::include_graphics("https://i.pinimg.com/originals/05/84/ae/0584aec69a54c545710acb12e0360d24.jpg")
```

Image source: <https://i.pinimg.com/originals/05/84/ae/0584aec69a54c545710acb12e0360d24.jpg>

Methodology {data-icon="fa-file"}
===================================== 

Row {.tabset data-height=600}
------
### **List of research questions and methods used**

All of the following questions will be explored using the same set of data, which is cleaned based on the raw data source (More information can be found in the third sub-tab).

* Research question 1: What are the distribution of sodium, total fat and total carbs in Starbucks drinks by category (caffeine/non-caffeine)?

  + Methodology: Use of histogram for each type of drink at Starbucks. Summary table for the average number of each ingredient found in each type.

* Research question 2: What is the relationship between sodium, total fat and total carbs in relation with calories?
  
  + Methodology: Use of interactive scatterplot for each type of drink at Starbucks. Summary table for the correlation coefficients that measure the strength of correlation between each ingredient and calories.

* Research question 3: What is a linear regression model that can showcase the relationship between sodium, total fat, total carbs and calories? What is a predicted level of calories in a Starbucks drink if it has 20mg of sodium, 10g of total fat, and 40g of total carbs?

   + Methodology: Use of linear regression model with evaluation of residuals and R squared.
   

### **Notes** 

* Ingredients like sugar, saturated fat, trans fat, etc are intentionally not considered in this report for two reasons. First, they are generally parts of some other ingredients in the dataset that will be considered (e.g. sugar is part of the total carbohydrates, saturated fat and trans fat are parts of the total fat). If they were to be included as well, it would cause some trouble with the regression analysis in the third part of the report. Second, given the expected output as outlined in the rubric, it is best to include only highly relevant ingredients in the analysis so that the final report would look concise while being informative enough.

* All drinks considered in this report will be in grande to ensure the consistency. Grande is the most popular size in the original dataset.

* The color choice in this report is taken from color blind friendly palettes suggested by Nichols (n.d.).



### **Data cleaning**

**1. Missingness check**

First, in order to make sure the data does not suffer from having a 
lot of missing values, we first check how many missing values it contains (if any) in the data.

Because the dataset contains `r n_miss(starbucks)` missing values, further steps to analyse the data can be proceeded. 

**2. Data type check**

We need to check the data type of all variables in the dataset to make sure they are in the right format using the `str()` function. By doing this, I can notice that the trans_fat_g and fiber_g variables are having the character type, which should be corrected to be numeric. 

```{r data type check,  echo = FALSE}
starbucks$trans_fat_g <- as.numeric(starbucks$trans_fat_g)
starbucks$fiber_g <- as.numeric(starbucks$fiber_g)   
```

**3. Creation of new variable**

I also created one column called Type to differ caffeine drinks and non-caffeine drinks.

```{r mutate,  echo = FALSE}
starbucks <- starbucks %>% 
    mutate(type = 
                         if_else(caffeine_mg > 0, "caffeine drink", "non-caffeine drink"))
```

**4. Size filtering**

Size is filtered to include grande size only.

```{r, echo = FALSE, warning = FALSE, message = FALSE}
starbucks_tidy <- starbucks %>%
  filter(size == "grande")
```

Column {.sidebar data-width=600}
----

```{r}
knitr::include_graphics("https://images.thestar.com/rXuS4l_IXmSp8s7N5-V4BDha2OY=/605x799/smart/filters:cb(2700061000)/https://www.thestar.com/content/dam/thestar/life/health_wellness/nutrition/2010/10/23/five_and_15_rule_for_food_nutrients_coming/nalabels22.jpeg")
```

Image source: <https://images.thestar.com/rXuS4l_IXmSp8s7N5-V4BDha2OY=/605x799/smart/filters:cb(2700061000)/https://www.thestar.com/content/dam/thestar/life/health_wellness/nutrition/2010/10/23/five_and_15_rule_for_food_nutrients_coming/nalabels22.jpeg>

Research question 1 {data-icon="fa-coffee"}
===================================== 

Row {.tabset data-height=600}
------
### **Distribution of soldium**

```{r} 
starbucks_caffeine <- starbucks_tidy %>%
filter(type == "caffeine drink")

starbucks_non_caffeine <- starbucks_tidy %>%
filter(type == "non-caffeine drink")
```

```{r}
plot1 <- ggplot(starbucks_caffeine, aes(x = sodium_mg)) +
  labs(x = "Total sodium (mg)", y = "Count", title = "Distribution of sodium in Starbucks caffeine drinks") +
  geom_histogram(fill = "#0C7BDC", position = "dodge", bins = 30, col=I("black")) +
  theme_bw()


plot2 <- ggplot(starbucks_non_caffeine, aes(x = sodium_mg)) +
  labs(x = "Total sodium (mg)", y = "Count", title = "Distribution of sodium in Starbucks non-caffeine drinks") +
  geom_histogram(fill = "#FFC20A", position = "dodge", bins = 30, col=I("black")) +
  theme_bw()

grid.arrange(plot1, plot2, nrow = 2)
```


### **Distribution of total fat**

```{r}
plot3 <- ggplot(starbucks_caffeine, aes(x = total_fat_g)) +
  labs(x = "Total fat (g)", y = "Count", title = "Distribution of total fat in Starbucks caffeine drinks") +
  geom_histogram(fill = "#0C7BDC", position = "dodge", bins = 30, col=I("black")) +
  theme_bw()


plot4 <- ggplot(starbucks_non_caffeine, aes(x = total_fat_g)) +
  labs(x = "Total fat (g)", y = "Count", title = "Distribution of total fat in Starbucks non-caffeine drinks") +
  geom_histogram(fill = "#FFC20A", position = "dodge", bins = 30, col=I("black")) +
  theme_bw()

grid.arrange(plot3, plot4, nrow = 2)
```

### **Distribution of total carbs**

```{r}
plot5 <- ggplot(starbucks_caffeine, aes(x = total_carbs_g)) +
  labs(x = "Total carbs (g)", y = "Count", title = "Distribution of total carbs in Starbucks caffeine drinks") +
  geom_histogram(fill = "#0C7BDC", position = "dodge", bins = 30, col=I("black")) +
  theme_bw()


plot6 <- ggplot(starbucks_non_caffeine, aes(x = total_carbs_g)) +
  labs(x = "Total carbs (g)", y = "Count", title = "Distribution of total carbs in Starbucks non-caffeine drinks") +
  geom_histogram(fill = "#FFC20A", position = "dodge", bins = 30, col=I("black")) +
  theme_bw()

grid.arrange(plot5, plot6, nrow = 2)
```


Row {data-width=600}
------
### **Average sodium, total fat and total carbs of Starbucks drink**

```{r}
summary <- starbucks_tidy %>% 
  group_by(type) %>% 
  summarise(average_soldium_mg= mean(sodium_mg), average_total_fat_g = mean(total_fat_g), average_total_carbs_g = mean(total_carbs_g))

kable(summary, digits = 2) %>%
  kable_styling(latex_options = "bordered")
```
Column {.sidebar data-width=450}
----
> **Notable findings**

1. Distribution of sodium

  * Overall, most of Starbucks drinks (both caffeine and non-caffeine) have more than 100mg of sodium. 
  * No drink appears to have the level of sodium that exceeds the recommended sodium intake for adults per day (2000mg) in Australia ( Food Standards Australia & New Zealand, 2021).
  * Most of caffeine drinks have sodium within the range of 100-200mg whereas the majority of non-caffeine drinks have sodium of 200mg and above.
  * These results are consistent with the fact that the average sodium level in caffeine drinks and non-caffeine drinks are 148.08mg and 166.78mg respectively.

2. Distribution of total fat

  * Overall, most of Starbucks drinks (both caffeine and non-caffeine) have less than 10g of total fat. 
  * This result is consistent with the fact that the average total fat level in caffeine drinks and non-caffeine drinks are 6.68g and 6.65g respectively.
  
3. Distribution of total carbs
  * Overall, most of Starbucks drinks (both caffeine and non-caffeine) have more than 20g of total carbs. 
  * Most of caffeine drinks have total carbs ranging widely from 20g to over 60g whereas the majority of non-caffeine drinks have total carbs within 50-60g.
  * These results are consistent with the fact that the average total carbs level in caffeine drinks and non-caffeine drinks are 39.4g and 50.4g respectively.
  


Research question 2 {data-icon="fa-coffee"}
===================================== 

Column {.tabset data-height=550}
------

### **Sugar and calories**

```{r}
p1 <- ggplot(starbucks_tidy, aes(sodium_mg, calories)) +
  geom_point(shape = 16) +
  geom_smooth(method = "lm", aes(color = type), se = F, size = 1.5, alpha = 0.4) +
  theme_bw() +
  ggtitle("Correlation between sugar and calories") +
  labs(x= "Sodium (mg)", y="Calories (KCal)", color = "Type") +
  scale_color_manual(values = c("#005AB5", "#DC3220"))
   

ggplotly(p1)
```


### **Total fat and calories**

```{r}
p2 <- ggplot(starbucks_tidy, aes(total_fat_g, calories)) +
  geom_point(shape = 16) +
  geom_smooth(method = "lm",aes(color = type), se = F, size = 1.5, alpha = 0.4) +
  theme_bw() + 
  ggtitle("Correlation between total fat and calories") +
  labs( x= "Total fat (g)", y="Calories (KCal)", color = "Type") +
  scale_color_manual(values = c("#005AB5", "#DC3220"))
  
ggplotly(p2)
```

### **Total carbs and calories**

```{r}
p3 <- ggplot(starbucks_tidy, aes(total_carbs_g, calories)) +
 geom_point(shape = 16) +
  geom_smooth(method = "lm" , aes(color = type), se = F, size = 1.5, alpha = 0.4) +
  theme_bw() + 
  ggtitle("Correlation between total carbs and calories") +
  labs( x= "Total carbs (g)", y="Calories (KCal)", color = "Type") +
  scale_color_manual(values = c("#005AB5", "#DC3220"))

ggplotly(p3)
```

Row {data-width=600}
------
### **Correlation coefficients**

```{r}
sugar_and_calories1 <- cor(starbucks_caffeine$sugar_g, starbucks_caffeine$calories)
sugar_and_calories2 <- cor(starbucks_non_caffeine$sugar_g, starbucks_non_caffeine$calories)
total_fat_and_calories1 <- cor(starbucks_caffeine$total_fat_g, starbucks_caffeine$calories)
total_fat_and_calories2 <- cor(starbucks_non_caffeine$total_fat_g, starbucks_non_caffeine$calories)
total_carbs_and_calories1 <- cor(starbucks_caffeine$total_carbs_g, starbucks_caffeine$calories)
total_carbs_and_calories2 <- cor(starbucks_non_caffeine$total_carbs_g, starbucks_non_caffeine$calories)
```

```{r}
table <- data.frame(type = rep(c("caffeine drink", "non-caffeine drink")),
                 sugar_and_calories = rep(c(sugar_and_calories1, sugar_and_calories2)),
                 total_fat_and_calories = rep(c(total_fat_and_calories1, total_fat_and_calories2)),
                 total_carbs_and_calories = rep(c(total_carbs_and_calories1, total_carbs_and_calories2)))

table %>% 
  kable(digits = 2) %>% 
  kable_styling(latex_options = "bordered")
```

Column {.sidebar data-width=450}
----
> **Analysis and implications**

1. Analysis

  * Overall, sugar, total fat and total carbs appear to have a positive correlation with the number of calories in Starbucks drinks for both caffeine and non-caffeine type.
  * All the correlation can be considered fairly strong as all range from 0.81 to 0.89 
 

2. Implications

  * The strong correlations between sugar, total fat, total carbs and calories suggest that these three factors can affect the number of calories in a Starbucks drink. As a result, they will be used as regressors for the model that aims to predict the number of calories in research question 3.
  
  


Research question 3 {data-icon="fa-coffee"}
===================================== 

Row {.tabset data-height=600}
------
### **Linear regression model**

```{r}
calories_mm <- lm(calories ~ sodium_mg + total_fat_g + total_carbs_g, data = starbucks_tidy)
tidy(calories_mm)
```

* The intercept is 17.9, meaning that a Starbucks drink that has 0mg of sodium, 0g of total fat and 0g of total carbs has 17.9 calories on average.

* The coefficient of sodium_mg is 0.043, meaning that on average the calories for a Starbucks drink holding the level of total fat and total carbs fixed increases by 0.043 for every extra mg of sodium. 

* The coefficient of total fat is 9.73, meaning that on average the calories for a Starbucks drink holding the level of sodium and total carbs fixed increases by 9.73 for every extra g of total fat. 

* The coefficient of total carbs is 3.86, meaning that on average the calories for a Starbucks drink holding the level of sodium and total fat fixed increases by 3.86 for every extra g of total carbs.  

The model can be written as follows:

$$\widehat{\text{Calories}} = 17.9 + 0.043~\text * {Sodium} + 9.73~\text *   {Total fat}+ 3.86~\text * {Total carbs} + \widehat{\epsilon} ~\text (KCal) (*) $$

*Note:* The model (*) includes the error term at the end.
 
### **Residuals**

```{r}
resid_panel(calories_mm, plots = "SAS")
```



### **Evaluation and prediction**
 
To assess the fitness of the model, it is essential to examine the residuals and R squared.  
 
In linear regression diagnostic, there are four assumptions for the residuals: independent, have a mean of zero, constant variance, and normally distributed. It can be observed that in the first residual plot (full plots available in the Residuals tab), the residuals obtained do not have a pattern, suggesting that they are independent of each other. Many of them also center around the horizontal line at zero, which indicates a mean of approximately zero. The residuals also appear to have a normal distribution. Finally, as there are not many outliers in the plots, so the residual variance can be regarded as fairly constant.
 
**R squared**

* R squared is a common measurement of strength of linear model fit. 

```{r}
r_squared <- broom::glance(calories_mm)$r.squared
```


* The R squared obtained in this case is `r r_squared`, which is high. This suggests about 97.8% of the model can be explained by the chosen regressors so it may be a good fit. However, it does not mean that the model is good or the best.  
 

**Prediction**

Based on the linear regression model (*), the total calories for a Starbucks drink with 20mg of sodium, 10g of total fat, and 40g of total carbs can be predicted as:

$$\widehat{\text{Calories}} = 17.9 + 0.043 * 20 + 9.73 * 10 + 3.86 * 40 = 270.46 ~\text (KCal)$$ 

Column {.sidebar data-width=600}
----

```{r}
knitr::include_graphics("https://coffeeatthree.com/wp-content/uploads/starbucks-secret-menu-1.jpg")
```
  
Image source: <https://coffeeatthree.com/wp-content/uploads/starbucks-secret-menu-1.jpg>

References {data-icon="fa-bars"}
===================================== 

Column {.tabset data-height=550}
------

**References**

[1] Arel-Bundock, V. (2022). “modelsummary: Data and Model Summaries in R.” Journal of Statistical Software, *103*(1), 1-23. doi:10.18637/jss.v103.i01 <https://doi.org/10.18637/jss.v103.i01>.

[2] Auguie, B. (2017). gridExtra: Miscellaneous Functions for "Grid" Graphics. R package version 2.3, <https://CRAN.R-project.org/package=gridExtra>.

[3] Carson, S. (2020). Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida.

[4] Food Standards Australia New Zealand. (2021). Sodium and salt. Retrieved from <https://www.foodstandards.gov.au/consumer/nutrition/salt/Pages/default.aspx#:~:text=What%20are%20the%20recommended%20sodium,of%20salt%20or%201%20teaspoon)>.

[5] Goode, K. Rey, K. (2019). ggResidpanel: Panels and Interactive Versions of Diagnostic Plots using 'ggplot2'. R package version 0.3.0, <https://CRAN.R-project.org/package=ggResidpanel>.

[6] Kassambara, A. (2020). ggpubr: 'ggplot2' Based Publication Ready Plots. R package version 0.4.0, <https://CRAN.R-project.org/package=ggpubr>.

[7] Nichols, D. (n.d.). Coloring for Colorblindness. Retrieved from <https://davidmathlogic.com/colorblind/#%23FFC20A-%230C7BDC>.

[8] Pedersen, T. Robinson, D. (2022). gganimate: A Grammar of Animated Graphics_. R package version 1.0.8,
  <https://CRAN.R-project.org/package=gganimate>.
  
[9] Robinson, D. Hayes, A. Couch, S. (2022). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.0,
  <https://CRAN.R-project.org/package=broom>.

[10] Sievert, C. Iannone, R. Allaire, J. Borges, B. (2022). flexdashboard: R Markdown Format for Flexible Dashboards. R package version 0.6.0, <https://CRAN.R-project.org/package=flexdashboard>.

[11] Silge, J. Robinson, D. (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS, *1*(3). doi:10.21105/joss.00037 <https://doi.org/10.21105/joss.00037>.

[12] Tierney, N. (2017). “visdat: Visualising Whole Data Frames.” JOSS, *2*(16), 355. doi:10.21105/joss.00355 <https://doi.org/10.21105/joss.00355>.

[13] Tierney, N. Cook, D. McBain, M. Fay, C. (2021). naniar: Data Structures, Summaries, and Visualisations for Missing Data. R package version 0.6.1,
  <https://CRAN.R-project.org/package=naniar>.
  
[14] Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.

[15] Wickham, H. Averick, M. Bryan, J. Chang, W. McGowan, LD. François, R. Grolemund, G. Hayes, A. Henry, L. Hester, J. Kuhn, M. Pedersen, TL. Miller, E. Bache, SM. Müller, K. Ooms, J. Robinson, D. Seidel, DP. Spinu, V. Takahashi, K. Vaughan, D. Wilke, C. Woo, K. Yutani, H. (2019). “Welcome to the tidyverse.” Journal of Open Source Software, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>

[16] Wickham, H. Bryan, J. (2022). readxl: Read Excel Files. R package version 1.4.0, <https://CRAN.R-project.org/package=readxl>.
Wickham, H. François, R. Henry, L. Müller, K. (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.9,
  <https://CRAN.R-project.org/package=dplyr>.
  
[17] Wickham, H. François, R. Henry, L. Müller, K. (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.9,
  <https://CRAN.R-project.org/package=dplyr>.
  
[18] Wickham, H. Hester, J. Bryan, J. (2022). readr: Read Rectangular Text Data. R package version 2.1.2, <https://CRAN.R-project.org/package=readr>.

[19] Yihui, X. (2022). bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.29.

[20] Zhu, H. (2021). kableExtra: Construct Complex Table with 'kable' and Pipe Syntax. R package version 1.3.4,
  <https://CRAN.R-project.org/package=kableExtra>.
  
[21] <https://i.pinimg.com/originals/05/84/ae/0584aec69a54c545710acb12e0360d24.jpg>

[22] <https://images.thestar.com/rXuS4l_IXmSp8s7N5-V4BDha2OY=/605x799/smart/filters:cb(2700061000)/https://www.thestar.com/content/dam/thestar/life/health_wellness/nutrition/2010/10/23/five_and_15_rule_for_food_nutrients_coming/nalabels22.jpeg>

[23] <https://coffeeatthree.com/wp-content/uploads/starbucks-secret-menu-1.jpg>